{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Tce3stUlHN0L"
      },
      "source": [
        "##### Copyright 2024 Google LLC."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "tuOe1ymfHZPu"
      },
      "outputs": [],
      "source": [
        "# @title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dfsDR_omdNea"
      },
      "source": [
        "# Gemma - deploy with vLLM\n",
        "\n",
        "This notebook demonstrates how you can deploy a Gemma model with [vLLM](https://github.com/vllm-project/vllm) and query it. vLLM is a fast and easy-to-use library for LLM inference and serving, and has built-in support for Gemma deployment.\n",
        "\n",
        "<table align=\"left\">\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://colab.research.google.com/github/google-gemini/gemma-cookbook/blob/main/Gemma/[Gemma_2]Deploy_with_vLLM.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
        "  </td>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MwMiP7jDdAL1"
      },
      "source": [
        "## Setup\n",
        "\n",
        "### Select the Colab runtime\n",
        "To complete this tutorial, you'll need to have a Colab runtime with sufficient resources to run the Gemma model. In this case, you can use a T4 GPU:\n",
        "\n",
        "1. In the upper-right of the Colab window, select **▾ (Additional connection options)**.\n",
        "2. Select **Change runtime type**.\n",
        "3. Under **Hardware accelerator**, select **T4 GPU**.\n",
        "\n",
        "\n",
        "### Gemma setup on Hugging Face\n",
        "vLLM uses Hugging Face under the hood. So you will need to:\n",
        "\n",
        "* Get access to Gemma on [huggingface.co](huggingface.co) by accepting the Gemma license on the Hugging Face page of the specific model, i.e., [Gemma 2B](https://huggingface.co/google/gemma-2b).\n",
        "* Generate a [Hugging Face access token](https://huggingface.co/docs/hub/en/security-tokens) and configure it as a Colab secret 'HF_TOKEN'."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "gJaQ-OVoPKCo"
      },
      "source": [
        "## Installation\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "DHrOMaOAPSAM"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Requirement already satisfied: vllm in /usr/local/lib/python3.10/dist-packages (0.4.2)\n",
            "Requirement already satisfied: cmake>=3.21 in /usr/local/lib/python3.10/dist-packages (from vllm) (3.27.9)\n",
            "Requirement already satisfied: ninja in /usr/local/lib/python3.10/dist-packages (from vllm) (1.11.1.1)\n",
            "Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from vllm) (5.9.5)\n",
            "Requirement already satisfied: sentencepiece in /usr/local/lib/python3.10/dist-packages (from vllm) (0.1.99)\n",
            "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from vllm) (1.25.2)\n",
            "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from vllm) (2.31.0)\n",
            "Requirement already satisfied: py-cpuinfo in /usr/local/lib/python3.10/dist-packages (from vllm) (9.0.0)\n",
            "Requirement already satisfied: transformers>=4.40.0 in /usr/local/lib/python3.10/dist-packages (from vllm) (4.41.1)\n",
            "Requirement already satisfied: tokenizers>=0.19.1 in /usr/local/lib/python3.10/dist-packages (from vllm) (0.19.1)\n",
            "Requirement already satisfied: fastapi in /usr/local/lib/python3.10/dist-packages (from vllm) (0.111.0)\n",
            "Requirement already satisfied: openai in /usr/local/lib/python3.10/dist-packages (from vllm) (1.30.2)\n",
            "Requirement already satisfied: uvicorn[standard] in /usr/local/lib/python3.10/dist-packages (from vllm) (0.29.0)\n",
            "Requirement already satisfied: pydantic>=2.0 in /usr/local/lib/python3.10/dist-packages (from vllm) (2.7.1)\n",
            "Requirement already satisfied: prometheus-client>=0.18.0 in /usr/local/lib/python3.10/dist-packages (from vllm) (0.20.0)\n",
            "Requirement already satisfied: prometheus-fastapi-instrumentator>=7.0.0 in /usr/local/lib/python3.10/dist-packages (from vllm) (7.0.0)\n",
            "Requirement already satisfied: tiktoken==0.6.0 in /usr/local/lib/python3.10/dist-packages (from vllm) (0.6.0)\n",
            "Requirement already satisfied: lm-format-enforcer==0.9.8 in /usr/local/lib/python3.10/dist-packages (from vllm) (0.9.8)\n",
            "Requirement already satisfied: outlines==0.0.34 in /usr/local/lib/python3.10/dist-packages (from vllm) (0.0.34)\n",
            "Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from vllm) (4.11.0)\n",
            "Requirement already satisfied: filelock>=3.10.4 in /usr/local/lib/python3.10/dist-packages (from vllm) (3.14.0)\n",
            "Requirement already satisfied: ray>=2.9 in /usr/local/lib/python3.10/dist-packages (from vllm) (2.23.0)\n",
            "Requirement already satisfied: nvidia-ml-py in /usr/local/lib/python3.10/dist-packages (from vllm) (12.550.52)\n",
            "Requirement already satisfied: vllm-nccl-cu12<2.19,>=2.18 in /usr/local/lib/python3.10/dist-packages (from vllm) (2.18.1.0.4.0)\n",
            "Requirement already satisfied: torch==2.3.0 in /usr/local/lib/python3.10/dist-packages (from vllm) (2.3.0+cu121)\n",
            "Requirement already satisfied: xformers==0.0.26.post1 in /usr/local/lib/python3.10/dist-packages (from vllm) (0.0.26.post1)\n",
            "Requirement already satisfied: interegular>=0.3.2 in /usr/local/lib/python3.10/dist-packages (from lm-format-enforcer==0.9.8->vllm) (0.3.3)\n",
            "Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from lm-format-enforcer==0.9.8->vllm) (24.0)\n",
            "Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from lm-format-enforcer==0.9.8->vllm) (6.0.1)\n",
            "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from outlines==0.0.34->vllm) (3.1.4)\n",
            "Requirement already satisfied: lark in /usr/local/lib/python3.10/dist-packages (from outlines==0.0.34->vllm) (1.1.9)\n",
            "Requirement already satisfied: nest-asyncio in /usr/local/lib/python3.10/dist-packages (from outlines==0.0.34->vllm) (1.6.0)\n",
            "Requirement already satisfied: cloudpickle in /usr/local/lib/python3.10/dist-packages (from outlines==0.0.34->vllm) (2.2.1)\n",
            "Requirement already satisfied: diskcache in /usr/local/lib/python3.10/dist-packages (from outlines==0.0.34->vllm) (5.6.3)\n",
            "Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from outlines==0.0.34->vllm) (1.11.4)\n",
            "Requirement already satisfied: numba in /usr/local/lib/python3.10/dist-packages (from outlines==0.0.34->vllm) (0.58.1)\n",
            "Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from outlines==0.0.34->vllm) (1.4.2)\n",
            "Requirement already satisfied: referencing in /usr/local/lib/python3.10/dist-packages (from outlines==0.0.34->vllm) (0.35.1)\n",
            "Requirement already satisfied: jsonschema in /usr/local/lib/python3.10/dist-packages (from outlines==0.0.34->vllm) (4.19.2)\n",
            "Requirement already satisfied: regex>=2022.1.18 in /usr/local/lib/python3.10/dist-packages (from tiktoken==0.6.0->vllm) (2023.12.25)\n",
            "Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (1.12)\n",
            "Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (3.3)\n",
            "Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (2023.6.0)\n",
            "Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (12.1.105)\n",
            "Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (12.1.105)\n",
            "Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (12.1.105)\n",
            "Requirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (8.9.2.26)\n",
            "Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (12.1.3.1)\n",
            "Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (11.0.2.54)\n",
            "Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (10.3.2.106)\n",
            "Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (11.4.5.107)\n",
            "Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (12.1.0.106)\n",
            "Requirement already satisfied: nvidia-nccl-cu12==2.20.5 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (2.20.5)\n",
            "Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (12.1.105)\n",
            "Requirement already satisfied: triton==2.3.0 in /usr/local/lib/python3.10/dist-packages (from torch==2.3.0->vllm) (2.3.0)\n",
            "Requirement already satisfied: nvidia-nvjitlink-cu12 in /usr/local/lib/python3.10/dist-packages (from nvidia-cusolver-cu12==11.4.5.107->torch==2.3.0->vllm) (12.5.40)\n",
            "Requirement already satisfied: starlette<1.0.0,>=0.30.0 in /usr/local/lib/python3.10/dist-packages (from prometheus-fastapi-instrumentator>=7.0.0->vllm) (0.37.2)\n",
            "Requirement already satisfied: annotated-types>=0.4.0 in /usr/local/lib/python3.10/dist-packages (from pydantic>=2.0->vllm) (0.7.0)\n",
            "Requirement already satisfied: pydantic-core==2.18.2 in /usr/local/lib/python3.10/dist-packages (from pydantic>=2.0->vllm) (2.18.2)\n",
            "Requirement already satisfied: click>=7.0 in /usr/local/lib/python3.10/dist-packages (from ray>=2.9->vllm) (8.1.7)\n",
            "Requirement already satisfied: msgpack<2.0.0,>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from ray>=2.9->vllm) (1.0.8)\n",
            "Requirement already satisfied: protobuf!=3.19.5,>=3.15.3 in /usr/local/lib/python3.10/dist-packages (from ray>=2.9->vllm) (3.20.3)\n",
            "Requirement already satisfied: aiosignal in /usr/local/lib/python3.10/dist-packages (from ray>=2.9->vllm) (1.3.1)\n",
            "Requirement already satisfied: frozenlist in /usr/local/lib/python3.10/dist-packages (from ray>=2.9->vllm) (1.4.1)\n",
            "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->vllm) (3.3.2)\n",
            "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->vllm) (3.7)\n",
            "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->vllm) (2.0.7)\n",
            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->vllm) (2024.2.2)\n",
            "Requirement already satisfied: huggingface-hub<1.0,>=0.16.4 in /usr/local/lib/python3.10/dist-packages (from tokenizers>=0.19.1->vllm) (0.23.1)\n",
            "Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.40.0->vllm) (0.4.3)\n",
            "Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers>=4.40.0->vllm) (4.66.4)\n",
            "Requirement already satisfied: fastapi-cli>=0.0.2 in /usr/local/lib/python3.10/dist-packages (from fastapi->vllm) (0.0.4)\n",
            "Requirement already satisfied: httpx>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from fastapi->vllm) (0.27.0)\n",
            "Requirement already satisfied: python-multipart>=0.0.7 in /usr/local/lib/python3.10/dist-packages (from fastapi->vllm) (0.0.9)\n",
            "Requirement already satisfied: ujson!=4.0.2,!=4.1.0,!=4.2.0,!=4.3.0,!=5.0.0,!=5.1.0,>=4.0.1 in /usr/local/lib/python3.10/dist-packages (from fastapi->vllm) (5.10.0)\n",
            "Requirement already satisfied: orjson>=3.2.1 in /usr/local/lib/python3.10/dist-packages (from fastapi->vllm) (3.10.3)\n",
            "Requirement already satisfied: email_validator>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from fastapi->vllm) (2.1.1)\n",
            "Requirement already satisfied: h11>=0.8 in /usr/local/lib/python3.10/dist-packages (from uvicorn[standard]->vllm) (0.14.0)\n",
            "Requirement already satisfied: httptools>=0.5.0 in /usr/local/lib/python3.10/dist-packages (from uvicorn[standard]->vllm) (0.6.1)\n",
            "Requirement already satisfied: python-dotenv>=0.13 in /usr/local/lib/python3.10/dist-packages (from uvicorn[standard]->vllm) (1.0.1)\n",
            "Requirement already satisfied: uvloop!=0.15.0,!=0.15.1,>=0.14.0 in /usr/local/lib/python3.10/dist-packages (from uvicorn[standard]->vllm) (0.19.0)\n",
            "Requirement already satisfied: watchfiles>=0.13 in /usr/local/lib/python3.10/dist-packages (from uvicorn[standard]->vllm) (0.21.0)\n",
            "Requirement already satisfied: websockets>=10.4 in /usr/local/lib/python3.10/dist-packages (from uvicorn[standard]->vllm) (12.0)\n",
            "Requirement already satisfied: anyio<5,>=3.5.0 in /usr/local/lib/python3.10/dist-packages (from openai->vllm) (3.7.1)\n",
            "Requirement already satisfied: distro<2,>=1.7.0 in /usr/lib/python3/dist-packages (from openai->vllm) (1.7.0)\n",
            "Requirement already satisfied: sniffio in /usr/local/lib/python3.10/dist-packages (from openai->vllm) (1.3.1)\n",
            "Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio<5,>=3.5.0->openai->vllm) (1.2.1)\n",
            "Requirement already satisfied: dnspython>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from email_validator>=2.0.0->fastapi->vllm) (2.6.1)\n",
            "Requirement already satisfied: typer>=0.12.3 in /usr/local/lib/python3.10/dist-packages (from fastapi-cli>=0.0.2->fastapi->vllm) (0.12.3)\n",
            "Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.10/dist-packages (from httpx>=0.23.0->fastapi->vllm) (1.0.5)\n",
            "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->outlines==0.0.34->vllm) (2.1.5)\n",
            "Requirement already satisfied: attrs>=22.2.0 in /usr/local/lib/python3.10/dist-packages (from jsonschema->outlines==0.0.34->vllm) (23.2.0)\n",
            "Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /usr/local/lib/python3.10/dist-packages (from jsonschema->outlines==0.0.34->vllm) (2023.12.1)\n",
            "Requirement already satisfied: rpds-py>=0.7.1 in /usr/local/lib/python3.10/dist-packages (from jsonschema->outlines==0.0.34->vllm) (0.18.1)\n",
            "Requirement already satisfied: llvmlite<0.42,>=0.41.0dev0 in /usr/local/lib/python3.10/dist-packages (from numba->outlines==0.0.34->vllm) (0.41.1)\n",
            "Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch==2.3.0->vllm) (1.3.0)\n",
            "Requirement already satisfied: shellingham>=1.3.0 in /usr/local/lib/python3.10/dist-packages (from typer>=0.12.3->fastapi-cli>=0.0.2->fastapi->vllm) (1.5.4)\n",
            "Requirement already satisfied: rich>=10.11.0 in /usr/local/lib/python3.10/dist-packages (from typer>=0.12.3->fastapi-cli>=0.0.2->fastapi->vllm) (13.7.1)\n",
            "Requirement already satisfied: markdown-it-py>=2.2.0 in /usr/local/lib/python3.10/dist-packages (from rich>=10.11.0->typer>=0.12.3->fastapi-cli>=0.0.2->fastapi->vllm) (3.0.0)\n",
            "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /usr/local/lib/python3.10/dist-packages (from rich>=10.11.0->typer>=0.12.3->fastapi-cli>=0.0.2->fastapi->vllm) (2.16.1)\n",
            "Requirement already satisfied: mdurl~=0.1 in /usr/local/lib/python3.10/dist-packages (from markdown-it-py>=2.2.0->rich>=10.11.0->typer>=0.12.3->fastapi-cli>=0.0.2->fastapi->vllm) (0.1.2)\n"
          ]
        }
      ],
      "source": [
        "!pip install vllm"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wn2lC5hVPUxy"
      },
      "source": [
        "## Inference\n",
        "\n",
        "### Import vLLM"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "q8o4A6QKPp3d"
      },
      "outputs": [],
      "source": [
        "from vllm import LLM"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "76mRtotdPu9N"
      },
      "source": [
        "Instantiate vLLM's engine (note that Colab's free T4 GPU only supports `float32`)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "k8y8SD1XPzAr"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
            "  warnings.warn(\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "a45c956964f441619463ea8cb6ecad5a",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "INFO 05-24 06:23:04 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='google/gemma-2b', speculative_config=None, tokenizer='google/gemma-2b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float32, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=google/gemma-2b)\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "128bba1e8f4b4e779f4c77bae0f969b2",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "tokenizer_config.json:   0%|          | 0.00/33.6k [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "92c14362d70b4dc78c443f7397fa1d71",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "0605043aa7a445d889338f8aab5d8b6f",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "99bb3e9715ea463b959d4c7b3a75fcc1",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "b03b15b10a0049398d9566685e84fa91",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "INFO 05-24 06:23:06 utils.py:660] Found nccl from library /root/.config/vllm/nccl/cu12/libnccl.so.2.18.1\n",
            "INFO 05-24 06:23:08 selector.py:69] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.\n",
            "INFO 05-24 06:23:08 selector.py:32] Using XFormers backend.\n",
            "WARNING 05-24 06:23:10 gemma.py:54] Gemma's activation function was incorrectly set to exact GeLU in the config JSON file when it was initially released. Changing the activation function to approximate GeLU (`gelu_pytorch_tanh`). If you want to use the legacy `gelu`, edit the config JSON to set `hidden_activation=gelu` instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.\n",
            "INFO 05-24 06:23:10 weight_utils.py:199] Using model weights format ['*.safetensors']\n"
          ]
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "163763c99f904eaf80620a6837169a56",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "aefe997ca83949628c822f46320ff4ed",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "INFO 05-24 06:24:08 model_runner.py:175] Loading model weights took 9.3440 GB\n",
            "INFO 05-24 06:24:16 gpu_executor.py:114] # GPU blocks: 1724, # CPU blocks: 7281\n",
            "INFO 05-24 06:24:19 model_runner.py:937] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.\n",
            "INFO 05-24 06:24:19 model_runner.py:941] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.\n",
            "INFO 05-24 06:24:29 model_runner.py:1017] Graph capturing finished in 11 secs.\n"
          ]
        }
      ],
      "source": [
        "llm = LLM(model=\"google/gemma-2b\", dtype=\"float32\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MMg2unvIOtH4"
      },
      "source": [
        "Run inference with a single prompt"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "_JihVtFsOwsn"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.42it/s]"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\n",
            "Prompt: The capital of USA is, Gemma output:  planning to invest in sustainable transportation and is ready to accept transformational advances in technology\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "\n"
          ]
        }
      ],
      "source": [
        "prompt = \"The capital of USA is\"\n",
        "output = llm.generate(prompt)\n",
        "print()\n",
        "print(f\"Prompt: {prompt}, Gemma output: {output[0].outputs[0].text}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "FlLNziE93xdP"
      },
      "source": [
        "Run batch inference"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "AW-ex-Fgs_La"
      },
      "outputs": [],
      "source": [
        "prompts = [\n",
        "    \"The best thing about California is\",\n",
        "    \"Physics studies\",\n",
        "    \"The best place for sightseeing in Japan is\",\n",
        "]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "R8MPTh5XtH0Z"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "Processed prompts: 100%|██████████| 3/3 [00:00<00:00,  3.59it/s]"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\n",
            "Prompt: The best thing about California is, Gemma output:  that there are always new ways to spend time and enjoy the magnificent state as a\n",
            "\n",
            "Prompt: Physics studies, Gemma output:  matter, its properties, how it interacts with other matter, and other fields of\n",
            "\n",
            "Prompt: The best place for sightseeing in Japan is, Gemma output:  Kyoto. It is the very first city you will visit in Japan when you get\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "\n"
          ]
        }
      ],
      "source": [
        "outputs = llm.generate(prompts)\n",
        "\n",
        "for output in outputs:\n",
        "    prompt = output.prompt\n",
        "    generated_text = output.outputs[0].text\n",
        "    print()\n",
        "    print(f\"Prompt: {prompt}, Gemma output: {generated_text}\")"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "name": "[Gemma_2]Deploy_with_vLLM.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
